To optimize the quality of coffee production for the Coffee Quality Institute that sources both Arabica and Robusta beans through:
The dataset was obtained through the GitHub Repo of James LeDoux. He obtained the dataset from the database of CQI through a scraper.
The dataset was obtained from the Coffee Quality Institute’s website in January 2018 using a scraping method. It comprises four CSV files, two of which contain uncleaned raw data sourced directly from the website. Given that the raw data were manually recorded with varying encodings, abbreviations, and units of measurement for fields such as farm names, altitude, region, and others, it is necessary to perform data cleaning to enhance the accuracy of the analysis. The dataset encompasses quality measures and farmers’ information for two types of coffee beans: arabica and robusta.
To overcome limitations in the dataset, addressing the variability constraint in bean types can be achieved by collecting additional datasets covering a wider range of coffee varieties, through collaboration with other coffee institutions. The 2018 dataset can be alleviated by implementing data collection from the recent data of the Coffee Quality Institute.
install.packages("tidyverse")
library(tidyverse)
library(janitor)
arabica_cleaned_data <- read.csv("arabica_data_cleaned.csv")
robusta_cleaned_data <- read.csv("robusta_data_cleaned.csv")
str(arabica_cleaned_data)
str(robusta_cleaned_data)
summary(arabica_cleaned_data)
summary(robusta_cleaned_data)
head(arabica_cleaned_data)
head(robusta_cleaned_data)
print(data.frame(Column_Name = names(arabica_cleaned_data)))
print(data.frame(Column_Name = names(robusta_cleaned_data)))
arabica_trimmed_data <- arabica_cleaned_data %>%
select(-1, -6, -8, -10, -13, -14, -15, -16, -37, -38, -39, -40, -41)
robusta_trimmed_data <- robusta_cleaned_data %>%
select(-1, -6, -8, -10, -13, -14, -15, -16, -37, -38, -39, -40, -41)
Column_Name Reason for dropping the variable
X This column contains just column numbers
Lot.Number Irrelevant for analysis
ICO.Number Irrelevant for analysis
Altitude A Range. Has been separated for more accuracy
Number.of.Bags No significance to the analysis
Bag.Weight No significance to the analysis
In.Country.Partner No significance to the analysis
Harvest.Year Dropped, as grading date is more accurate
Expiration Irrelevant for analysis
Certification.Body Irrelevant for analysis
Certification.Address Irrelevant for analysis
Certification.Contact Irrelevant for analysis
unit_of_measurement Irrelevant for analysis
sum(is.na(arabica_trimmed_data))
## [1] 682
sum(is.na(robusta_trimmed_data))
## [1] 9
Given the substantial number of missing values, especially in the altitude variable of the dataset, I decided against removing those rows. This choice is intended to prevent the unintentional deletion of relevant information, ensuring the accuracy of my analysis.
sum(duplicated(arabica_trimmed_data))
## [1] 0
sum(duplicated(robusta_trimmed_data))
## [1] 0
It revealed the absence of any duplications in both data frames.
arabica_trimmed_data <- janitor::clean_names(arabica_trimmed_data)
robusta_trimmed_data <- janitor::clean_names(robusta_trimmed_data)
print(data.frame(Column_Name = names(arabica_trimmed_data)))
print(data.frame(Column_Name = names(robusta_trimmed_data)))
min(arabica_trimmed_data$total_cup_points)
## [1] 0
mean(arabica_trimmed_data$total_cup_points)
## [1] 82.11593
max(arabica_trimmed_data$total_cup_points)
## [1] 90.58
min(robusta_trimmed_data$total_cup_points)
## [1] 73.75
mean(robusta_trimmed_data$total_cup_points)
## [1] 80.86893
max(robusta_trimmed_data$total_cup_points)
## [1] 83.75
ggplot(data = arabica_trimmed_data) +
geom_histogram(mapping = aes(x = total_cup_points, fill = "Brown")) +
labs(
title = "Histogram of Arabica Total Cup Points",
x = "Total Cup Points",
y = "Count of Beans"
)
ggplot(data = robusta_trimmed_data) +
geom_histogram(mapping = aes(x = total_cup_points, fill = "Brown")) +
labs(
title = "Histogram of Robusta Total Cup Points",
x = "Total Cup Points",
y = "Count of Beans"
)
Upon identifying outliers, I focused into the data to investigate the root causes and determine whether they should be retained or removed. This examination aims to provide understanding for informed decision-making in subsequent analyses.
Upon review, it appears that one bean was scored incorrectly, receiving a score of zero due to an incomplete assessment. Consequently, I have opted to exclude it from the dataset to prevent potential distortion of the analysis results.
arabica_trimmed_data <- arabica_trimmed_data %>%
filter(total_cup_points > 0)
arabica_trimmed_data$grading_date <-
mdy(str_replace_all(arabica_trimmed_data$grading_date,
"\\b(\\d+)th\\b", "\\1"))
robusta_trimmed_data$grading_date <-
mdy(str_replace_all(robusta_trimmed_data$grading_date,
"\\b(\\d+)th\\b", "\\1"))
combined_coffee_data <- rbind(arabica_trimmed_data, robusta_trimmed_data)
write.csv(arabica_trimmed_data, "arabica_trimmed_data.csv", row.names = FALSE)
write.csv(robusta_trimmed_data, "robusta_trimmed_data.csv", row.names = FALSE)
write.csv(combined_coffee_data, "combined_coffee_data.csv", row.names = FALSE)
So that I can transform data from wide to long format, I installed the reshape2 package.
install.packages("reshape2")
library(reshape2)
To identify the key quality factors significantly influencing the overall quality of a coffee bean, I employed correlation calculations between the sensory values of the coffee bean and the total cup points.
I decided to compare correlation between aroma, flavor, aftertaste, acidity, body, balance, uniformity, clean cup, sweetness, cupper points and total cup points. This will be applied to arabica, robusta and their combined data for further analysis.
arabica_correlation_matrix <-
cor(arabica_trimmed_data[, c(13:22, 23)])
robusta_correlation_matrix <-
cor(robusta_trimmed_data[, c(13:22, 23)])
combined_correlation_matrix <-
cor(combined_coffee_data[, c(13:22, 23)])
Melted correlation matrix for the 3 dataframes for visualization purposes
arabica_melted_correlation <- melt(arabica_correlation_matrix)
robusta_melted_correlation <- melt(robusta_correlation_matrix)
combined_melted_correlation <- melt(combined_correlation_matrix)
To gain a deeper understanding of the data, I chose to visualize it using a heatmap. This allowed me to examine the strength of the correlation between each sensory value and the total cup points.
ggplot(arabica_melted_correlation) +
geom_tile(mapping = aes(Var1, Var2, fill = value), color = "white") +
geom_text(aes(Var1, Var2, label = paste0(round(value * 100), "%")),
size = 3) +
labs(title = "Arabica Correlation Heatmap") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10, hjust = 1))
ggplot(robusta_melted_correlation) +
geom_tile(mapping = aes(Var1, Var2, fill = value), color = "white") +
geom_text(aes(Var1, Var2, label = paste0(round(value * 100), "%")),
size = 3) +
labs(title = "Robusta Correlation Heatmap") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10, hjust = 1))
ggplot(combined_melted_correlation) +
geom_tile(mapping = aes(Var1, Var2, fill = value), color = "white") +
geom_text(aes(Var1, Var2, label = paste0(round(value * 100), "%")),
size = 3) +
labs(title = "Arabica & Robusta Correlation Heatmap") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10, hjust = 1))
I calculated the mean values of each sensory quality for Arabica and Robusta beans, enabling a side-by-side comparison through visualization. Subsequently, results were consolidated into a single dataframe and were melted.
mean_arabica <- colMeans(arabica_trimmed_data[, 13:22], na.rm = TRUE)
mean_robusta <- colMeans(robusta_trimmed_data[, 13:22], na.rm = TRUE)
mean_sensory_values <- data.frame(
Attribute = names(mean_arabica),
Arabica = mean_arabica,
Robusta = mean_robusta
)
melted_mean_sensory <- melt(mean_sensory_values)
ggplot(data = melted_mean_sensory) +
geom_bar(mapping = aes(x = Attribute, y = value, fill = variable),
stat = "identity", position = "dodge", color = "white") +
labs(title = "Comparison of Mean Sensory Attributes: Arabica vs Robusta",
y = "Mean Value",
x = "Sensory Attribute",
fill = "Species") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10,
hjust = 1))
To determine the leading producers of high-quality coffee, I computed the average total cup points for each country.
arabica_quality_country <-
aggregate(arabica_trimmed_data[, 13:23],
by = list(country = arabica_trimmed_data$country_of_origin),
FUN = mean)
robusta_quality_country <-
aggregate(robusta_trimmed_data[, 13:23],
by = list(country = robusta_trimmed_data$country_of_origin),
FUN = mean)
combined_quality_country <-
aggregate(combined_coffee_data[, 13:23],
by = list(country = combined_coffee_data$country_of_origin),
FUN = mean)
arabica_quality_country <-
arabica_quality_country[order(-arabica_quality_country$total_cup_points), ]
robusta_quality_country <-
robusta_quality_country[order(-robusta_quality_country$total_cup_points), ]
combined_quality_country <-
combined_quality_country[order(-combined_quality_country$total_cup_points), ]
Following the calculation of average total cup points for each country, I proceeded to create a bar graph for a visual representation of the top-performing countries.
# for arabica
ggplot(data = arabica_quality_country) +
geom_bar(mapping = aes(x = reorder(country, -total_cup_points),
y = total_cup_points),
stat = "identity", fill = "red") +
labs(title = "Top Quality Producing Countries for Arabica",
x = "Country of Origin",
y = "Average Total Cup Points") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# for robusta
ggplot(data = robusta_quality_country) +
geom_bar(mapping = aes(x = reorder(country, -total_cup_points),
y = total_cup_points),
stat = "identity", fill = "red") +
labs(title = "Top Quality Producing Countries for Robusta",
x = "Country of Origin",
y = "Average Total Cup Points") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# for combined data:
ggplot(data = combined_quality_country) +
geom_bar(mapping = aes(x = reorder(country, -total_cup_points),
y = total_cup_points),
stat = "identity", fill = "red") +
labs(title = "Top Overall Quality Coffee Producing Countries",
x = "Country of Origin",
y = "Average Total Cup Points") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
To identify the leading coffee producers within specific countries, I filtered the data frames for Arabica, focusing on the United States, Papua New Guinea, Ethiopia, Japan, and Kenya. Subsequently, I organized the results in descending order based on the mean total cup points. However, to gain insights into consistent high-quality coffee production by species, I chose to extract the top 10 owners irrespective of their respective countries.
# Filtering the data based on the top countries producing arabica
filtered_arabica_country <- c("United States", "Papua New Guinea",
"Ethiopia", "Japan", "Kenya")
# Grouping owners by country
grouped_owners_arabica <- arabica_trimmed_data %>%
filter(country_of_origin %in% filtered_arabica_country) %>%
group_by(owner, country_of_origin) %>%
summarise(mean_total_cup_points = mean(total_cup_points, na.rm = TRUE)) %>%
arrange(-mean_total_cup_points)
grouped_owners_robusta <- robusta_trimmed_data %>%
group_by(owner, country_of_origin) %>%
summarise(mean_total_cup_points = mean(total_cup_points, na.rm = TRUE)) %>%
arrange(-mean_total_cup_points)
# Identifying top 10 owners that produces quality coffee bean
top_owners_arabica <- head(grouped_owners_arabica, 10)
top_owners_robusta <- head(grouped_owners_robusta, 10)
print(top_owners_arabica)
## # A tibble: 10 × 3
## # Groups: owner [10]
## owner country_of_origin mean_total_cup_points
## <chr> <chr> <dbl>
## 1 metad plc Ethiopia 89.8
## 2 yidnekachew dabessa Ethiopia 89
## 3 diamond enterprise plc Ethiopia 88.2
## 4 mohammed lalo Ethiopia 88.1
## 5 cqi q coffee sample repr… United States 87.3
## 6 ji-ae ahn Ethiopia 87.1
## 7 hider abamecha Ethiopia 86.2
## 8 lusso lab Kenya 86
## 9 ibrahim hussien speciali… Ethiopia 85.8
## 10 colbran coffeelands, ltd. Papua New Guinea 85.8
print(top_owners_robusta)
## # A tibble: 10 × 3
## # Groups: owner [9]
## owner country_of_origin mean_total_cup_points
## <chr> <chr> <dbl>
## 1 katuka development trust… Uganda 83
## 2 ankole coffee producers … Uganda 82.6
## 3 nishant gurjer India 82.5
## 4 ugacof Uganda 82.4
## 5 andrew hetzel India 81.6
## 6 kasozi coffee farmers as… Uganda 81.5
## 7 kawacom uganda ltd Uganda 80.9
## 8 nitubaasa ltd Uganda 80.6
## 9 mannya coffee project Uganda 80.5
## 10 andrew hetzel United States 79.3
I utilized a bar graph for a visual examination of the top 10 coffee owners for each coffee bean specie.
# top 10 coffee owners for arabica
ggplot(data = top_owners_arabica) +
geom_bar(mapping = aes(x = reorder(owner, -mean_total_cup_points),
y = mean_total_cup_points,
fill = country_of_origin),
stat = "identity") +
labs(title = "Top 10 Owners that Produce Quality Arabica Beans",
x = "Name of Owner",
y = "Average Total Cup Points") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# top 10 coffee owners for robusta
ggplot(data = top_owners_robusta) +
geom_bar(mapping = aes(x = reorder(owner, -mean_total_cup_points),
y = mean_total_cup_points,
fill = country_of_origin),
stat = "identity", position = "dodge") +
labs(title = "Top 10 Owners that Produce Quality Robusta Beans",
x = "Name of Owner",
y = "Average Total Cup Points") +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
In order to analyze the potential impact of altitude on the overall quality of a coffee bean, I calculated the correlation between all sensory values and the average altitude.
# Calculating correlation between sensory values and altitude
altitude_correlation_arabica <-
cor(na.omit(arabica_trimmed_data[, c(13:23, 31)]))
altitude_correlation_robusta <-
cor(na.omit(robusta_trimmed_data[, c(13:23, 31)]))
# Melting data frame for use in ggplot
melted_altitude_arabica <- melt(altitude_correlation_arabica)
melted_altitude_robusta <- melt(altitude_correlation_robusta)
To visually explore the correlation between sensory values and average altitude, I created a heatmap.
ggplot(melted_altitude_arabica) +
geom_tile(mapping = aes(Var1, Var2, fill = value), color = "white") +
geom_text(aes(Var1, Var2, label = paste0(round(value * 100), "%")),
size = 3) +
labs(title = "Heatmap: Sensory Values vs Average Altitude - Arabica") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10, hjust = 1))
ggplot(melted_altitude_robusta) +
geom_tile(mapping = aes(Var1, Var2, fill = value), color = "white") +
geom_text(aes(Var1, Var2, label = paste0(round(value * 100), "%")),
size = 3) +
labs(title = "Heatmap: Sensory Values vs Average Altitude - Robusta") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10, hjust = 1))
To have other insights, I decided to visualize it through a scatterplot
# for arabica
ggplot(data = arabica_trimmed_data) +
geom_point(mapping = aes(x = total_cup_points, y = altitude_mean_meters)) +
labs(title = "Scatterplot of Total Cup Points vs Average Altitude (Arabica)",
x = "Total Cup Points",
y = "Average Altitude")
# for robusta
ggplot(data = robusta_trimmed_data) +
geom_point(mapping = aes(x = total_cup_points, y = altitude_mean_meters)) +
labs(title = "Scatterplot of Total Cup Points vs Average Altitude (Robusta)",
x = "Total Cup Points",
y = "Average Altitude")
Following the analysis of the scatterplot results, I further examined the altitude ranges associated with the top 5 countries for both Arabica and Robusta. This additional investigation provides valuable insights into the elevation preferences associated with the production of high-quality coffee in these regions.
# Analyzing what range of altitude does the top quality producing countries have
grouped_altitude_arabica <- arabica_trimmed_data %>%
filter(country_of_origin %in% filtered_arabica_country) %>%
group_by(country_of_origin) %>%
summarise(AvgAltLow = mean(altitude_low_meters, na.rm = TRUE),
AvgAltHigh = mean(altitude_high_meters, na.rm = TRUE),
mean_total_cup_points = mean(total_cup_points, na.rm = TRUE)) %>%
arrange(-mean_total_cup_points)
grouped_altitude_robusta <- robusta_trimmed_data %>%
group_by(country_of_origin) %>%
summarise(AvgAltLow = mean(altitude_low_meters, na.rm = TRUE),
AvgAltHigh = mean(altitude_high_meters, na.rm = TRUE),
mean_total_cup_points = mean(total_cup_points, na.rm = TRUE)) %>%
arrange(-mean_total_cup_points)
print(grouped_altitude_arabica)
## # A tibble: 5 × 4
## country_of_origin AvgAltLow AvgAltHigh mean_total_cup_points
## <chr> <dbl> <dbl> <dbl>
## 1 United States 1859. 1859. 86.0
## 2 Papua New Guinea 1700 1700 85.8
## 3 Ethiopia 1702. 1899. 85.5
## 4 Japan 170 170 84.7
## 5 Kenya 1466. 1539. 84.3
print(grouped_altitude_robusta)
## # A tibble: 5 × 4
## country_of_origin AvgAltLow AvgAltHigh mean_total_cup_points
## <chr> <dbl> <dbl> <dbl>
## 1 Uganda 1330. 1380. 81.9
## 2 India 1422. 1422. 81.4
## 3 Ecuador 40 40 78.4
## 4 United States 1898. 1898. 78.2
## 5 Vietnam NaN NaN 73.8
Based on the analysis, the Coffee Quality Institute (CQI) could refine its sourcing strategy in several ways: